Skip to content

fix(rag): don't let a broken native dep crash every agent import#1534

Merged
kovtcharov-amd merged 3 commits into
mainfrom
kalin/rag-import-guard
Jun 8, 2026
Merged

fix(rag): don't let a broken native dep crash every agent import#1534
kovtcharov-amd merged 3 commits into
mainfrom
kalin/rag-import-guard

Conversation

@kovtcharov-amd

Copy link
Copy Markdown
Collaborator

Why this matters

Before: if the optional RAG stack was installed but a native library underneath it couldn't load — most often torchcodec/FFmpeg pulled in transitively by sentence-transformers, or an arch-mismatched faissgaia chat and every other agent died at import time with a raw RuntimeError, even though RAG was never used. The crash had nothing to do with what the user was running; one broken wheel took the whole CLI down.

After: a broken optional dependency is treated the same as a missing one — the import succeeds, agents load, and if (and only if) RAG is actually used, RAGSDK raises a loud, actionable error that names the real cause ("installed but failed to load … e.g. a missing FFmpeg for torchcodec") instead of telling the user to reinstall a package they already have.

Root cause: the guard caught only ImportError, but native-load failures raise RuntimeError/OSError.

Test plan

  • python -m pytest tests/unit/rag/test_dependency_guard.py -v — 3 pass
  • import gaia.agents.docqa.agent no longer raises in an environment with broken torchcodec (previously crashed at module load)
  • Instantiating RAGSDK with a broken dep raises ImportError naming the captured cause; a genuinely missing dep gets install instructions only
  • black / isort clean on changed files

A broken native dependency under the optional RAG stack — most commonly
torchcodec/FFmpeg pulled in transitively by sentence-transformers, but
also an arch-mismatched faiss build — raises RuntimeError/OSError at
import, not ImportError. The guard in rag/sdk.py only caught ImportError,
so the exception escaped and killed the entire import chain: importing
gaia.agents.chat (or any agent that transitively imports RAG) died at
module load even when RAG was never used, taking `gaia chat` and friends
down with it.

Broaden the sentence-transformers and faiss guards to treat any import
failure as "not installed", and capture the failure reason so the loud,
deferred error in RAGSDK._check_dependencies() distinguishes "not
installed" (reinstall) from "installed but broken" (fix the underlying
native dep, e.g. FFmpeg) instead of misdirecting the user.
@github-actions github-actions Bot added rag RAG system changes tests Test changes performance Performance-critical changes labels Jun 8, 2026
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

Solid, well-scoped fix for a real problem: a broken native dep under sentence-transformers (torchcodec/FFmpeg) or an arch-mismatched faiss used to crash every agent at import time, even when RAG was never touched. Widening the guard from except ImportError to except Exception and deferring the loud error to _check_dependencies() is the right shape, and the comments explain the why clearly.

One blocking issue: the secondary goal — telling "not installed" apart from "installed but broken" — doesn't actually work for the common case. Because the guard captures ModuleNotFoundError (a subclass of ImportError) into _*_IMPORT_ERROR too, a genuinely missing package gets reported under the "installed but failed to load — reinstalling won't help" banner, which is the exact misdirection the PR set out to fix. The new test passes only because it hand-sets the error to None rather than exercising the real import path.

Issues Found

🟡 Important — genuinely-missing dep is mislabeled "installed but failed to load" (src/gaia/rag/sdk.py:41-45)

except Exception catches ModuleNotFoundError (an ImportError), so when a package is simply not installed, _*_IMPORT_ERROR is still truthy. _check_dependencies (sdk.py:242-247) then adds it to broken and prints "installed but failed to load — reinstalling won't help until the underlying error is fixed" — telling a user whose fix is pip install not to bother. That's the most common path (anyone without the [rag] extras who triggers RAG), and it directly contradicts the PR's stated purpose.

Fix: only capture the cause when it is not an ImportError. For the sentence-transformers block:

_SENTENCE_TRANSFORMERS_IMPORT_ERROR = None
try:
    from sentence_transformers import SentenceTransformer
except ImportError:
    SentenceTransformer = None
except Exception as _e:  # pylint: disable=broad-except
    SentenceTransformer = None
    _SENTENCE_TRANSFORMERS_IMPORT_ERROR = _e

And the same shape for the faiss block (sdk.py:24-30):

_FAISS_IMPORT_ERROR = None
try:
    import faiss
except ImportError:
    faiss = None
except Exception as _e:  # pylint: disable=broad-except
    faiss = None
    _FAISS_IMPORT_ERROR = _e

With this, genuinely-missing → _*_IMPORT_ERROR stays None → the broken-section is omitted (which is exactly what test_genuinely_missing_dep_omits_broken_section simulates), and only true native-load failures (RuntimeError/OSError) get the "reinstalling won't help" hint.

🟢 Minor — "genuinely missing" test doesn't exercise the real capture path (tests/unit/rag/test_dependency_guard.py:112)

test_genuinely_missing_dep_omits_broken_section manually patches _SENTENCE_TRANSFORMERS_IMPORT_ERROR to None, which is not the state the import block produces for a missing package (it produces a ModuleNotFoundError). That's why the test stays green despite the bug above. After applying the import fix, consider also asserting the import block itself yields None for an ImportError cause (e.g. simulate by re-running the try/except logic with a patched importer), so the regression is actually pinned rather than assumed.

Strengths

  • Correct diagnosis and minimal blast radius — the import-crash fix itself is sound: import succeeds, agents load, and the error is deferred to point-of-use in __init__ via _check_dependencies().
  • The except Exception here is justified (KeyboardInterrupt/SystemExit derive from BaseException and stay uncaught), and the pylint: disable is appropriately narrow.
  • Comments explain the why (native-load failures raise RuntimeError/OSError, not ImportError) in one tight block rather than narrating the what — matches the repo's comment guidance.
  • Test file is focused and readable, and the module-imports-without-deps test directly pins the regression that matters most.

Verdict

Request changes — the core import-crash fix is good to go, but the missing-vs-broken distinction misfires on the common "not installed" path and prints actively misleading guidance. The fix is two small per-import edits (above) plus tightening the one test; happy to re-approve once applied.

…lint

The module-level error-capture assignments tripped pylint C0413
(wrong-import-position). Recover the failure reason inside
_check_dependencies() via importlib instead, keeping the module's
import block clean. Behaviour is unchanged.
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

🟡 tests/unit/rag/test_dependency_guard.py — tests patch the wrong hook and don't exercise the code they claim to cover.

_check_dependencies() calls importlib.import_module(pkg), but both new tests intercept builtins.__import__. Those two entry points are independent: importlib.import_module goes straight to _bootstrap._gcd_import and never touches builtins.__import__. The fake RuntimeError/ImportError is never raised, broken stays empty, and test_broken_install_reports_actionable_cause will fail on any CI node where sentence-transformers isn't installed (the assert "installed but failed to load" in msg assertion won't be satisfied).

Fix: patch sdk.importlib.import_module instead of builtins.__import__:

    def fake_import(name, *args, **kwargs):
        if name == "sentence_transformers":
            raise RuntimeError("Could not load libtorchcodec (FFmpeg not found)")
        return real_import_module(name)

    real_import_module = importlib.import_module
    monkeypatch.setattr(sdk.importlib, "import_module", fake_import)

(Same pattern for the test_genuinely_missing_dep_omits_broken_section test — swap builtins.__import__ for sdk.importlib.import_module.)

itomek
itomek previously approved these changes Jun 8, 2026
…overable

importlib.import_module bypasses builtins.__import__, so the recovery
path that names an installed-but-broken dependency could never be
exercised by the dependency-guard tests (they intercept imports), and
in a deps-absent CI environment the broken-cause section was never
produced — failing test_broken_install_reports_actionable_cause. Use
the __import__ builtin to re-run the real import, which both surfaces
the native load error in production and is interceptable in tests.
@kovtcharov-amd kovtcharov-amd added this pull request to the merge queue Jun 8, 2026
Merged via the queue into main with commit de7c501 Jun 8, 2026
38 checks passed
@kovtcharov-amd kovtcharov-amd deleted the kalin/rag-import-guard branch June 8, 2026 22:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance-critical changes rag RAG system changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants